A Framework for Text Processing and Supporting Access to Collections of Digitized Historical Newspapers
Identifieur interne : 000F12 ( Main/Exploration ); précédent : 000F11; suivant : 000F13A Framework for Text Processing and Supporting Access to Collections of Digitized Historical Newspapers
Auteurs : B. Allen [États-Unis] ; Andrea Japzon [États-Unis] ; Palakorn Achananuparp [États-Unis] ; Jung Lee [États-Unis]Source :
- Lecture Notes in Computer Science [ 0302-9743 ] ; 2007.
Abstract
Abstract: Large quantities of historical newspapers are being digitized and OCRd. We describe a framework for processing the OCRd text to identify articles and extract metadata for them. We describe the article schema and provide examples of features that facilitate automatic indexing of them. For this processing, we employ lexical semantics, structural models, and community content. Furthermore, we describe visualization and summarization techniques that can be used to present the extracted events.
Url:
DOI: 10.1007/978-3-540-73354-6_26
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream Istex, to step Corpus: 000928
- to stream Istex, to step Curation: 000918
- to stream Istex, to step Checkpoint: 000927
- to stream Main, to step Merge: 000F25
- to stream Main, to step Curation: 000F12
Le document en format XML
<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title xml:lang="en">A Framework for Text Processing and Supporting Access to Collections of Digitized Historical Newspapers</title>
<author><name sortKey="Allen, B" sort="Allen, B" uniqKey="Allen B" first="B." last="Allen">B. Allen</name>
</author>
<author><name sortKey="Japzon, Andrea" sort="Japzon, Andrea" uniqKey="Japzon A" first="Andrea" last="Japzon">Andrea Japzon</name>
</author>
<author><name sortKey="Achananuparp, Palakorn" sort="Achananuparp, Palakorn" uniqKey="Achananuparp P" first="Palakorn" last="Achananuparp">Palakorn Achananuparp</name>
</author>
<author><name sortKey="Lee, Jung" sort="Lee, Jung" uniqKey="Lee J" first="Jung" last="Lee">Jung Lee</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:339038902DE6BB0B9E4B4EE9271F38ADA614AC7A</idno>
<date when="2007" year="2007">2007</date>
<idno type="doi">10.1007/978-3-540-73354-6_26</idno>
<idno type="url">https://api.istex.fr/document/339038902DE6BB0B9E4B4EE9271F38ADA614AC7A/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000928</idno>
<idno type="wicri:Area/Istex/Curation">000918</idno>
<idno type="wicri:Area/Istex/Checkpoint">000927</idno>
<idno type="wicri:doubleKey">0302-9743:2007:Allen B:a:framework:for</idno>
<idno type="wicri:Area/Main/Merge">000F25</idno>
<idno type="wicri:Area/Main/Curation">000F12</idno>
<idno type="wicri:Area/Main/Exploration">000F12</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">A Framework for Text Processing and Supporting Access to Collections of Digitized Historical Newspapers</title>
<author><name sortKey="Allen, B" sort="Allen, B" uniqKey="Allen B" first="B." last="Allen">B. Allen</name>
<affiliation wicri:level="2"><country xml:lang="fr">États-Unis</country>
<placeName><region type="state">Pennsylvanie</region>
</placeName>
<wicri:cityArea>College of Information Science and Technology, Drexel University Philadelphia</wicri:cityArea>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">États-Unis</country>
</affiliation>
</author>
<author><name sortKey="Japzon, Andrea" sort="Japzon, Andrea" uniqKey="Japzon A" first="Andrea" last="Japzon">Andrea Japzon</name>
<affiliation wicri:level="2"><country xml:lang="fr">États-Unis</country>
<placeName><region type="state">Pennsylvanie</region>
</placeName>
<wicri:cityArea>College of Information Science and Technology, Drexel University Philadelphia</wicri:cityArea>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">États-Unis</country>
</affiliation>
</author>
<author><name sortKey="Achananuparp, Palakorn" sort="Achananuparp, Palakorn" uniqKey="Achananuparp P" first="Palakorn" last="Achananuparp">Palakorn Achananuparp</name>
<affiliation wicri:level="2"><country xml:lang="fr">États-Unis</country>
<placeName><region type="state">Pennsylvanie</region>
</placeName>
<wicri:cityArea>College of Information Science and Technology, Drexel University Philadelphia</wicri:cityArea>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">États-Unis</country>
</affiliation>
</author>
<author><name sortKey="Lee, Jung" sort="Lee, Jung" uniqKey="Lee J" first="Jung" last="Lee">Jung Lee</name>
<affiliation wicri:level="2"><country xml:lang="fr">États-Unis</country>
<placeName><region type="state">Pennsylvanie</region>
</placeName>
<wicri:cityArea>College of Information Science and Technology, Drexel University Philadelphia</wicri:cityArea>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">États-Unis</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="s">Lecture Notes in Computer Science</title>
<imprint><date>2007</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">339038902DE6BB0B9E4B4EE9271F38ADA614AC7A</idno>
<idno type="DOI">10.1007/978-3-540-73354-6_26</idno>
<idno type="ChapterID">26</idno>
<idno type="ChapterID">Chap26</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass></textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Abstract: Large quantities of historical newspapers are being digitized and OCRd. We describe a framework for processing the OCRd text to identify articles and extract metadata for them. We describe the article schema and provide examples of features that facilitate automatic indexing of them. For this processing, we employ lexical semantics, structural models, and community content. Furthermore, we describe visualization and summarization techniques that can be used to present the extracted events.</div>
</front>
</TEI>
<affiliations><list><country><li>États-Unis</li>
</country>
<region><li>Pennsylvanie</li>
</region>
</list>
<tree><country name="États-Unis"><region name="Pennsylvanie"><name sortKey="Allen, B" sort="Allen, B" uniqKey="Allen B" first="B." last="Allen">B. Allen</name>
</region>
<name sortKey="Achananuparp, Palakorn" sort="Achananuparp, Palakorn" uniqKey="Achananuparp P" first="Palakorn" last="Achananuparp">Palakorn Achananuparp</name>
<name sortKey="Achananuparp, Palakorn" sort="Achananuparp, Palakorn" uniqKey="Achananuparp P" first="Palakorn" last="Achananuparp">Palakorn Achananuparp</name>
<name sortKey="Allen, B" sort="Allen, B" uniqKey="Allen B" first="B." last="Allen">B. Allen</name>
<name sortKey="Japzon, Andrea" sort="Japzon, Andrea" uniqKey="Japzon A" first="Andrea" last="Japzon">Andrea Japzon</name>
<name sortKey="Japzon, Andrea" sort="Japzon, Andrea" uniqKey="Japzon A" first="Andrea" last="Japzon">Andrea Japzon</name>
<name sortKey="Lee, Jung" sort="Lee, Jung" uniqKey="Lee J" first="Jung" last="Lee">Jung Lee</name>
<name sortKey="Lee, Jung" sort="Lee, Jung" uniqKey="Lee J" first="Jung" last="Lee">Jung Lee</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000F12 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000F12 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Exploration |type= RBID |clé= ISTEX:339038902DE6BB0B9E4B4EE9271F38ADA614AC7A |texte= A Framework for Text Processing and Supporting Access to Collections of Digitized Historical Newspapers }}
This area was generated with Dilib version V0.6.32. |